Questions to Address:
In this assignment, we are required to give explanations for the 5 failures stated in failures.json, where their information can be found in the test_data.
A failure example found in failures.json:
[
2220, # failure time, Seconds past 00:00
[
["os_021", "Sent_queue"]
] # 2 root causes: [cmdb_id, key_name]
]
In the above, there is something wrong within the system at 00:37. We need to figure out in the system the specific performance that can indicate the failure and the corresponding root causes. For instance, we can give our answer from the following respects:
Beware of the components in a trace, as they may influence each other through their call relationships.
Example:
cb48b17261295c467617 (37bd717261295c467616)
-> 8073017261295c527619 (37bd717261295c467616)
-> 62e7817261295c589816 (37bd717261295c467616)
-> ... (19 IDs)
-> ... (etc...)
Tip: Follow the `traceId` to obtain a group of related service calls
| callType | startTime | msgTime | elapsedTime | success | traceId | id | pid | cmdb_id | serviceName | dsName | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CSF | 2020-05-30 00:00:00.082000+08:00 | 2020-05-30 00:00:00.968000+08:00 | 886 | True | 37bd717261295c467616 | 8073017261295c527619 | cb48b17261295c467617 | os_021 | csf_001 | NaN |
| 4 | CSF | 2020-05-30 00:00:00.103000+08:00 | 2020-05-30 00:00:00.264000+08:00 | 160 | True | 37bd717261295c467616 | fca3117261295c679826 | 62e7817261295c589816 | docker_004 | csf_002 | NaN |
| 10 | CSF | 2020-05-30 00:00:00.268000+08:00 | 2020-05-30 00:00:00.440000+08:00 | 171 | True | 37bd717261295c467616 | 15dcd17261295d0a9830 | 62e7817261295c589816 | docker_004 | csf_003 | NaN |
| 12 | CSF | 2020-05-30 00:00:00.441000+08:00 | 2020-05-30 00:00:00.559000+08:00 | 118 | True | 37bd717261295c467616 | d2fdc17261295db99834 | 62e7817261295c589816 | docker_004 | csf_004 | NaN |
| 16 | CSF | 2020-05-30 00:00:00.561000+08:00 | 2020-05-30 00:00:00.614000+08:00 | 53 | True | 37bd717261295c467616 | fe0fa17261295e319838 | 62e7817261295c589816 | docker_004 | csf_005 | NaN |
| 1706877 | FlyRemote | 2020-05-30 00:00:00.094000+08:00 | 2020-05-30 00:00:00.102000+08:00 | 8 | True | 37bd717261295c467616 | 1f47017261295c5e9822 | 62e7817261295c589816 | docker_004 | fly_remote_001 | NaN |
| 2050698 | JDBC | 2020-05-30 00:00:00.183000+08:00 | 2020-05-30 00:00:00.189000+08:00 | 6 | True | 37bd717261295c467616 | b212a17261295cb74404 | f65a417261295cb64401 | docker_007 | NaN | db_003 |
| 2050700 | JDBC | 2020-05-30 00:00:00.192000+08:00 | 2020-05-30 00:00:00.252000+08:00 | 60 | True | 37bd717261295c467616 | 9b86517261295cc04407 | f65a417261295cb64401 | docker_007 | NaN | db_003 |
| 2050706 | JDBC | 2020-05-30 00:00:00.281000+08:00 | 2020-05-30 00:00:00.289000+08:00 | 8 | True | 37bd717261295c467616 | a472c17261295d193626 | ce7b117261295d193623 | docker_008 | NaN | db_003 |
| 2050710 | JDBC | 2020-05-30 00:00:00.291000+08:00 | 2020-05-30 00:00:00.384000+08:00 | 93 | True | 37bd717261295c467616 | 54aa417261295d233629 | ce7b117261295d193623 | docker_008 | NaN | db_003 |
| 2050714 | JDBC | 2020-05-30 00:00:00.387000+08:00 | 2020-05-30 00:00:00.419000+08:00 | 32 | True | 37bd717261295c467616 | 5b62017261295d833634 | e0dfa17261295d823631 | docker_008 | NaN | db_003 |
| 2050718 | JDBC | 2020-05-30 00:00:00.448000+08:00 | 2020-05-30 00:00:00.453000+08:00 | 5 | True | 37bd717261295c467616 | 7a76117261295dc04492 | 5847c17261295dc04489 | docker_007 | NaN | db_003 |
| 2050721 | JDBC | 2020-05-30 00:00:00.456000+08:00 | 2020-05-30 00:00:00.524000+08:00 | 68 | True | 37bd717261295c467616 | 73a0517261295dc84497 | 5847c17261295dc04489 | docker_007 | NaN | db_003 |
| 2050726 | JDBC | 2020-05-30 00:00:00.529000+08:00 | 2020-05-30 00:00:00.538000+08:00 | 9 | True | 37bd717261295c467616 | 8ff5417261295e114514 | 0f31817261295e0f4509 | docker_007 | NaN | db_003 |
| 2050730 | JDBC | 2020-05-30 00:00:00.574000+08:00 | 2020-05-30 00:00:00.583000+08:00 | 9 | True | 37bd717261295c467616 | f65c017261295e3e4530 | 12b3917261295e3d4527 | docker_007 | NaN | db_003 |
| 2050734 | JDBC | 2020-05-30 00:00:00.584000+08:00 | 2020-05-30 00:00:00.594000+08:00 | 10 | True | 37bd717261295c467616 | af64017261295e484543 | 12b3917261295e3d4527 | docker_007 | NaN | db_003 |
| 2050738 | JDBC | 2020-05-30 00:00:00.615000+08:00 | 2020-05-30 00:00:00.616000+08:00 | 1 | True | 37bd717261295c467616 | ef07117261295e679851 | e32fb17261295e679848 | docker_004 | NaN | db_009 |
| 2050739 | JDBC | 2020-05-30 00:00:00.616000+08:00 | 2020-05-30 00:00:00.617000+08:00 | 1 | True | 37bd717261295c467616 | a84b317261295e689854 | e32fb17261295e679848 | docker_004 | NaN | db_009 |
| 2050740 | JDBC | 2020-05-30 00:00:00.619000+08:00 | 2020-05-30 00:00:00.620000+08:00 | 1 | True | 37bd717261295c467616 | bda4c17261295e6b9859 | 8282817261295e699856 | docker_004 | NaN | db_009 |
| 2050743 | JDBC | 2020-05-30 00:00:00.622000+08:00 | 2020-05-30 00:00:00.622000+08:00 | 0 | True | 37bd717261295c467616 | 75a6917261295e6e9864 | 27dbc17261295e6c9861 | docker_004 | NaN | db_009 |
| 2050746 | JDBC | 2020-05-30 00:00:00.624000+08:00 | 2020-05-30 00:00:00.625000+08:00 | 0 | True | 37bd717261295c467616 | 6e03d17261295e709869 | bdd2a17261295e6f9866 | docker_004 | NaN | db_009 |
| 2050747 | JDBC | 2020-05-30 00:00:00.626000+08:00 | 2020-05-30 00:00:00.627000+08:00 | 1 | True | 37bd717261295c467616 | 87d6417261295e729879 | c7c0817261295e719873 | docker_004 | NaN | db_009 |
| 2050753 | JDBC | 2020-05-30 00:00:00.630000+08:00 | 2020-05-30 00:00:00.630000+08:00 | 0 | True | 37bd717261295c467616 | 836f117261295e769892 | 91d0417261295e739886 | docker_004 | NaN | db_009 |
| 2050759 | JDBC | 2020-05-30 00:00:00.633000+08:00 | 2020-05-30 00:00:00.633000+08:00 | 0 | True | 37bd717261295c467616 | ba88317261295e799904 | 8145f17261295e779896 | docker_004 | NaN | db_009 |
| 2050771 | JDBC | 2020-05-30 00:00:00.651000+08:00 | 2020-05-30 00:00:00.652000+08:00 | 1 | True | 37bd717261295c467616 | 0595717261295e8b9914 | c96ae17261295e7a9909 | docker_004 | NaN | db_009 |
| 2050774 | JDBC | 2020-05-30 00:00:00.655000+08:00 | 2020-05-30 00:00:00.656000+08:00 | 1 | True | 37bd717261295c467616 | 6696d17261295e8f9927 | a8bc817261295e8d9919 | docker_004 | NaN | db_009 |
| 2050775 | JDBC | 2020-05-30 00:00:00.658000+08:00 | 2020-05-30 00:00:00.658000+08:00 | 0 | True | 37bd717261295c467616 | c49b317261295e929937 | 0efe517261295e909931 | docker_004 | NaN | db_009 |
| 2050777 | JDBC | 2020-05-30 00:00:00.660000+08:00 | 2020-05-30 00:00:00.661000+08:00 | 1 | True | 37bd717261295c467616 | fd96b17261295e949947 | 09fce17261295e939941 | docker_004 | NaN | db_009 |
| 2050780 | JDBC | 2020-05-30 00:00:00.664000+08:00 | 2020-05-30 00:00:00.664000+08:00 | 0 | True | 37bd717261295c467616 | 0448517261295e989959 | 6bdec17261295e969954 | docker_004 | NaN | db_009 |
| 2050782 | JDBC | 2020-05-30 00:00:00.666000+08:00 | 2020-05-30 00:00:00.666000+08:00 | 0 | True | 37bd717261295c467616 | a39cc17261295e9a9969 | f5a3e17261295e989964 | docker_004 | NaN | db_009 |
| 2050784 | JDBC | 2020-05-30 00:00:00.667000+08:00 | 2020-05-30 00:00:00.670000+08:00 | 3 | True | 37bd717261295c467616 | 86e3c17261295e9b9977 | 79b8c17261295e9b9974 | docker_004 | NaN | db_007 |
| 2050787 | JDBC | 2020-05-30 00:00:00.671000+08:00 | 2020-05-30 00:00:00.674000+08:00 | 3 | True | 37bd717261295c467616 | 90a8f17261295e9f9995 | 79b8c17261295e9b9974 | docker_004 | NaN | db_007 |
| 10886645 | LOCAL | 2020-05-30 00:00:00.182000+08:00 | 2020-05-30 00:00:00.253000+08:00 | 71 | True | 37bd717261295c467616 | f65a417261295cb64401 | 7bf1417261295c9b4394 | docker_007 | local_method_011 | db_003 |
| 10886649 | LOCAL | 2020-05-30 00:00:00.281000+08:00 | 2020-05-30 00:00:00.385000+08:00 | 104 | True | 37bd717261295c467616 | ce7b117261295d193623 | 5a67717261295d123619 | docker_008 | local_method_013 | db_003 |
| 10886653 | LOCAL | 2020-05-30 00:00:00.386000+08:00 | 2020-05-30 00:00:00.419000+08:00 | 33 | True | 37bd717261295c467616 | e0dfa17261295d823631 | 5a67717261295d123619 | docker_008 | local_method_014 | db_003 |
| 10886657 | LOCAL | 2020-05-30 00:00:00.448000+08:00 | 2020-05-30 00:00:00.524000+08:00 | 76 | True | 37bd717261295c467616 | 5847c17261295dc04489 | c9d2b17261295dbb4485 | docker_007 | local_method_015 | db_003 |
| 10886661 | LOCAL | 2020-05-30 00:00:00.527000+08:00 | 2020-05-30 00:00:00.539000+08:00 | 12 | True | 37bd717261295c467616 | 0f31817261295e0f4509 | c9d2b17261295dbb4485 | docker_007 | local_method_016 | db_003 |
| 10886665 | LOCAL | 2020-05-30 00:00:00.574000+08:00 | 2020-05-30 00:00:00.594000+08:00 | 20 | True | 37bd717261295c467616 | 12b3917261295e3d4527 | 9ecec17261295e344519 | docker_007 | local_method_017 | db_003 |
| 10886669 | LOCAL | 2020-05-30 00:00:00.615000+08:00 | 2020-05-30 00:00:00.617000+08:00 | 2 | True | 37bd717261295c467616 | e32fb17261295e679848 | 62e7817261295c589816 | docker_004 | local_method_001 | db_009 |
| 10886670 | LOCAL | 2020-05-30 00:00:00.617000+08:00 | 2020-05-30 00:00:00.620000+08:00 | 3 | True | 37bd717261295c467616 | 8282817261295e699856 | 62e7817261295c589816 | docker_004 | local_method_002 | db_009 |
| 10886671 | LOCAL | 2020-05-30 00:00:00.620000+08:00 | 2020-05-30 00:00:00.623000+08:00 | 3 | True | 37bd717261295c467616 | 27dbc17261295e6c9861 | 62e7817261295c589816 | docker_004 | local_method_002 | db_009 |
| 10886674 | LOCAL | 2020-05-30 00:00:00.623000+08:00 | 2020-05-30 00:00:00.625000+08:00 | 2 | True | 37bd717261295c467616 | bdd2a17261295e6f9866 | 62e7817261295c589816 | docker_004 | local_method_003 | db_009 |
| 10886677 | LOCAL | 2020-05-30 00:00:00.625000+08:00 | 2020-05-30 00:00:00.627000+08:00 | 2 | True | 37bd717261295c467616 | c7c0817261295e719873 | 62e7817261295c589816 | docker_004 | local_method_004 | db_009 |
| 10886679 | LOCAL | 2020-05-30 00:00:00.627000+08:00 | 2020-05-30 00:00:00.631000+08:00 | 4 | True | 37bd717261295c467616 | 91d0417261295e739886 | 62e7817261295c589816 | docker_004 | local_method_004 | db_009 |
| 10886684 | LOCAL | 2020-05-30 00:00:00.631000+08:00 | 2020-05-30 00:00:00.634000+08:00 | 3 | True | 37bd717261295c467616 | 8145f17261295e779896 | 62e7817261295c589816 | docker_004 | local_method_004 | db_009 |
| 10886690 | LOCAL | 2020-05-30 00:00:00.634000+08:00 | 2020-05-30 00:00:00.653000+08:00 | 19 | True | 37bd717261295c467616 | c96ae17261295e7a9909 | 62e7817261295c589816 | docker_004 | local_method_004 | db_009 |
| 10886702 | LOCAL | 2020-05-30 00:00:00.653000+08:00 | 2020-05-30 00:00:00.656000+08:00 | 3 | True | 37bd717261295c467616 | a8bc817261295e8d9919 | 62e7817261295c589816 | docker_004 | local_method_005 | db_009 |
| 10886705 | LOCAL | 2020-05-30 00:00:00.656000+08:00 | 2020-05-30 00:00:00.659000+08:00 | 3 | True | 37bd717261295c467616 | 0efe517261295e909931 | 62e7817261295c589816 | docker_004 | local_method_006 | db_009 |
| 10886707 | LOCAL | 2020-05-30 00:00:00.659000+08:00 | 2020-05-30 00:00:00.662000+08:00 | 3 | True | 37bd717261295c467616 | 09fce17261295e939941 | 62e7817261295c589816 | docker_004 | local_method_007 | db_009 |
| 10886709 | LOCAL | 2020-05-30 00:00:00.662000+08:00 | 2020-05-30 00:00:00.664000+08:00 | 2 | True | 37bd717261295c467616 | 6bdec17261295e969954 | 62e7817261295c589816 | docker_004 | local_method_006 | db_009 |
| 10886711 | LOCAL | 2020-05-30 00:00:00.664000+08:00 | 2020-05-30 00:00:00.667000+08:00 | 3 | True | 37bd717261295c467616 | f5a3e17261295e989964 | 62e7817261295c589816 | docker_004 | local_method_009 | db_009 |
| 10886713 | LOCAL | 2020-05-30 00:00:00.667000+08:00 | 2020-05-30 00:00:00.674000+08:00 | 7 | True | 37bd717261295c467616 | 79b8c17261295e9b9974 | 62e7817261295c589816 | docker_004 | local_method_010 | db_007 |
| 17687689 | OSB | 2020-05-30 00:00:00.070000+08:00 | 2020-05-30 00:00:00.993000+08:00 | 922 | True | 37bd717261295c467616 | cb48b17261295c467617 | None | os_021 | osb_001 | NaN |
| 18033277 | RemoteProcess | 2020-05-30 00:00:00.088000+08:00 | 2020-05-30 00:00:00.679000+08:00 | 591 | True | 37bd717261295c467616 | 62e7817261295c589816 | 8073017261295c527619 | docker_004 | csf_001 | NaN |
| 18033281 | RemoteProcess | 2020-05-30 00:00:00.155000+08:00 | 2020-05-30 00:00:00.262000+08:00 | 107 | True | 37bd717261295c467616 | 7bf1417261295c9b4394 | fca3117261295c679826 | docker_007 | csf_002 | NaN |
| 18033288 | RemoteProcess | 2020-05-30 00:00:00.274000+08:00 | 2020-05-30 00:00:00.438000+08:00 | 164 | True | 37bd717261295c467616 | 5a67717261295d123619 | 15dcd17261295d0a9830 | docker_008 | csf_003 | NaN |
| 18033289 | RemoteProcess | 2020-05-30 00:00:00.443000+08:00 | 2020-05-30 00:00:00.558000+08:00 | 115 | True | 37bd717261295c467616 | c9d2b17261295dbb4485 | d2fdc17261295db99834 | docker_007 | csf_004 | NaN |
| 18033293 | RemoteProcess | 2020-05-30 00:00:00.564000+08:00 | 2020-05-30 00:00:00.613000+08:00 | 49 | True | 37bd717261295c467616 | 9ecec17261295e344519 | fe0fa17261295e319838 | docker_007 | csf_005 | NaN |
cmdb_id¶docker_001 to docker_008os_021 to os_022Therefore, for this assignment we will, as much as possible, associate to these cmdb_ids
| count | |
|---|---|
| docker_002 | 3065368 |
| docker_003 | 3061799 |
| docker_004 | 3060144 |
| docker_001 | 3058418 |
| docker_005 | 1703686 |
| docker_007 | 1701104 |
| docker_008 | 1700397 |
| docker_006 | 1699433 |
| os_022 | 344828 |
| os_021 | 344788 |
cmdb_id have difference callType¶docker_001 to docker_004 contain: JDBC, LOCAL, CSF, FlyRemote, RemoteProcessdocker_005 to docker_008 contain: JDBC, LOCAL, RemoteProcessos_021 to os_022 contain: OSB, CSFNote that this merely means that these are the services (callType) under the host
It does not mean that the cmdb_id will not call other cmdb_id (Refer to example in Point #1: where processes apart from os_021 are called too)
| count | ||
|---|---|---|
| cmdb_id | callType | |
| docker_001 | JDBC | 1357451 |
| LOCAL | 1188712 | |
| CSF | 340455 | |
| FlyRemote | 85900 | |
| RemoteProcess | 85900 | |
| docker_002 | JDBC | 1361103 |
| LOCAL | 1191050 | |
| CSF | 341129 | |
| FlyRemote | 86043 | |
| RemoteProcess | 86043 | |
| docker_003 | JDBC | 1359637 |
| LOCAL | 1189566 | |
| CSF | 340687 | |
| RemoteProcess | 85964 | |
| FlyRemote | 85945 | |
| docker_004 | JDBC | 1358624 |
| LOCAL | 1189076 | |
| CSF | 340578 | |
| FlyRemote | 85933 | |
| RemoteProcess | 85933 | |
| docker_005 | JDBC | 851049 |
| LOCAL | 511518 | |
| RemoteProcess | 341119 | |
| docker_006 | JDBC | 848925 |
| LOCAL | 510043 | |
| RemoteProcess | 340465 | |
| docker_007 | JDBC | 849753 |
| LOCAL | 510719 | |
| RemoteProcess | 340632 | |
| docker_008 | JDBC | 849405 |
| LOCAL | 510360 | |
| RemoteProcess | 340632 | |
| os_021 | OSB | 172798 |
| CSF | 171990 | |
| os_022 | OSB | 172790 |
| CSF | 172038 |
callType for difference processes¶docker_001 to docker_004 ratio:JDBC = 44%LOCAL = 38%CSF = 11%FlyRemote = 2%RemoteProcess = 2%docker_005 to docker_008 ratio:JDBC = 50%LOCAL = 30%RemoteProcess = 20%os_021 to os_022 ratio:OSB = 50%CSF = 50%These could suggest a certain mapping amongst each service call. For example, we suspect that OSB and CSF is closely related in a 1-to-1 mapping
| ratio | ||
|---|---|---|
| cmdb_id | callType | |
| docker_001 | JDBC | 44.38% |
| LOCAL | 38.87% | |
| CSF | 11.13% | |
| FlyRemote | 2.81% | |
| RemoteProcess | 2.81% | |
| docker_002 | JDBC | 44.4% |
| LOCAL | 38.86% | |
| CSF | 11.13% | |
| FlyRemote | 2.81% | |
| RemoteProcess | 2.81% | |
| docker_003 | JDBC | 44.41% |
| LOCAL | 38.85% | |
| CSF | 11.13% | |
| RemoteProcess | 2.81% | |
| FlyRemote | 2.81% | |
| docker_004 | JDBC | 44.4% |
| LOCAL | 38.86% | |
| CSF | 11.13% | |
| FlyRemote | 2.81% | |
| RemoteProcess | 2.81% | |
| docker_005 | JDBC | 49.95% |
| LOCAL | 30.02% | |
| RemoteProcess | 20.02% | |
| docker_006 | JDBC | 49.95% |
| LOCAL | 30.01% | |
| RemoteProcess | 20.03% | |
| docker_007 | JDBC | 49.95% |
| LOCAL | 30.02% | |
| RemoteProcess | 20.02% | |
| docker_008 | JDBC | 49.95% |
| LOCAL | 30.01% | |
| RemoteProcess | 20.03% | |
| os_021 | OSB | 50.12% |
| CSF | 49.88% | |
| os_022 | OSB | 50.11% |
| CSF | 49.89% |
serviceName¶Note the difference service names available for each callType and each cmdb_id
For docker_001 to docker_004:
JDBC -> No serviceNameLOCAL -> local_method_001 to local_method_010CSF -> csf_002 to csf_005FlyRemote -> fly_remmote_001RemoteProcess -> csf_001For docker_005 to docker_008:
JDBC -> No serviceNameLOCAL -> local_method_011, local_method_013 to local_method_017RemoteProcess -> csf_002 to csf_005For docker_001 to docker_004:
CSF -> csf_001OSB -> osb_001| count | |||
|---|---|---|---|
| cmdb_id | callType | serviceName | |
| docker_001 | CSF | csf_002 | 85724 |
| csf_003 | 84913 | ||
| csf_004 | 84910 | ||
| csf_005 | 84908 | ||
| FlyRemote | fly_remote_001 | 85900 | |
| LOCAL | local_method_004 | 339632 | |
| local_method_002 | 169816 | ||
| local_method_006 | 169816 | ||
| local_method_001 | 84908 | ||
| local_method_003 | 84908 | ||
| local_method_005 | 84908 | ||
| local_method_007 | 84908 | ||
| local_method_009 | 84908 | ||
| local_method_010 | 84908 | ||
| RemoteProcess | csf_001 | 85900 | |
| docker_002 | CSF | csf_002 | 85893 |
| csf_003 | 85081 | ||
| csf_004 | 85078 | ||
| csf_005 | 85077 | ||
| FlyRemote | fly_remote_001 | 86043 | |
| LOCAL | local_method_004 | 340300 | |
| local_method_002 | 170150 | ||
| local_method_006 | 170150 | ||
| local_method_001 | 85075 | ||
| local_method_003 | 85075 | ||
| local_method_005 | 85075 | ||
| local_method_007 | 85075 | ||
| local_method_009 | 85075 | ||
| local_method_010 | 85075 | ||
| RemoteProcess | csf_001 | 86043 | |
| docker_003 | CSF | csf_002 | 85770 |
| csf_003 | 84974 | ||
| csf_004 | 84972 | ||
| csf_005 | 84971 | ||
| FlyRemote | fly_remote_001 | 85945 | |
| LOCAL | local_method_004 | 339876 | |
| local_method_002 | 169938 | ||
| local_method_006 | 169938 | ||
| local_method_001 | 84969 | ||
| local_method_003 | 84969 | ||
| local_method_005 | 84969 | ||
| local_method_007 | 84969 | ||
| local_method_009 | 84969 | ||
| local_method_010 | 84969 | ||
| RemoteProcess | csf_001 | 85964 | |
| docker_004 | CSF | csf_002 | 85768 |
| csf_003 | 84939 | ||
| csf_004 | 84937 | ||
| csf_005 | 84934 | ||
| FlyRemote | fly_remote_001 | 85933 | |
| LOCAL | local_method_004 | 339736 | |
| local_method_002 | 169868 | ||
| local_method_006 | 169868 | ||
| local_method_001 | 84934 | ||
| local_method_003 | 84934 | ||
| local_method_005 | 84934 | ||
| local_method_007 | 84934 | ||
| local_method_009 | 84934 | ||
| local_method_010 | 84934 | ||
| RemoteProcess | csf_001 | 85933 | |
| docker_005 | LOCAL | local_method_011 | 85891 |
| local_method_015 | 85369 | ||
| local_method_016 | 85368 | ||
| local_method_013 | 85031 | ||
| local_method_014 | 85031 | ||
| local_method_017 | 84828 | ||
| RemoteProcess | csf_002 | 85891 | |
| csf_004 | 85369 | ||
| csf_003 | 85031 | ||
| csf_005 | 84828 | ||
| docker_006 | LOCAL | local_method_011 | 85726 |
| local_method_017 | 85157 | ||
| local_method_013 | 84963 | ||
| local_method_014 | 84960 | ||
| local_method_015 | 84619 | ||
| local_method_016 | 84618 | ||
| RemoteProcess | csf_002 | 85726 | |
| csf_005 | 85157 | ||
| csf_003 | 84963 | ||
| csf_004 | 84619 | ||
| docker_007 | LOCAL | local_method_011 | 85756 |
| local_method_015 | 85197 | ||
| local_method_016 | 85194 | ||
| local_method_013 | 84895 | ||
| local_method_014 | 84893 | ||
| local_method_017 | 84784 | ||
| RemoteProcess | csf_002 | 85756 | |
| csf_004 | 85197 | ||
| csf_003 | 84895 | ||
| csf_005 | 84784 | ||
| docker_008 | LOCAL | local_method_011 | 85780 |
| local_method_017 | 85121 | ||
| local_method_013 | 85018 | ||
| local_method_014 | 85017 | ||
| local_method_015 | 84712 | ||
| local_method_016 | 84712 | ||
| RemoteProcess | csf_002 | 85781 | |
| csf_005 | 85121 | ||
| csf_003 | 85018 | ||
| csf_004 | 84712 | ||
| os_021 | CSF | csf_001 | 171990 |
| OSB | osb_001 | 172798 | |
| os_022 | CSF | csf_001 | 172038 |
| OSB | osb_001 | 172790 |
os_021 or os_022¶os_021 365022 os_022 365019 Name: cmdb_id, dtype: int64
os_021 172798 os_022 172790 Name: cmdb_id, dtype: int64
Goal: To figure out the specific performance that can indicate the failure and the corresponding root causes, giving the answer from the following respects:
Ultimately, the goal is to understand how each failure can be identified (through which metrics) in the given data. This mimics a real-world machine learning project, where we first have to understand the data and obtain the relevant domain knowledge to perform feature selection.
In this case, Assignment 2 is a lead up to the course project, where we will carefully choose the algorithms to apply based on our findings in Assignment 2.
This report analysis will be structured as follow:
For ESB Indicators
We will plot a comparison graph between test (with failure) and train (no failure) data, and across a period of 1 hour (since ESB indicators are recorded in 1 min intervals) to observe the impact of the failure on the business-side metrics. These are the metrics that impact and of primary concern for end users (business POV).
For Host Indicators
We will plot a comparison graph between test (with failure) and train (no failure) data, and across specified periods (since host indicators are recorded in different intervals) to observe the impact of the failure on the host machine metrics. These provide the high-level view of metrics indicating failure and are of primary concern for infrastructure engineers.
For Trace Indicators
We will plot a comparison graph between test (with failure) and train (no failure) data to observe the impact of the failure on the service request metrics. These may serve to both explain the cause for the failure (developer POV) and for us to identify similar failures in the future (site reliability engineer POV), depending on the perspective.
However, this is more complicated and the general steps is summarised as follow:
cmdb_id)Analysis: From the 4 line charts, we observe that although the success rate did not change (subplot #4 remained constant at 1), the average time taken to process each service call increased sharply and the number of services requests handled decreased by half around the failure time period. This unusual behaviour can be noted by comparing against itself across 1 hour, as well as comparing against the train data (normal data with no failures) across the same time period. This indicates that failure #1 does not result in a failure of a service call, but could lead to other side effects such as longer call times and fewer services requests handled. This phenomenon could potentially be due to the app's error handling or simply due to poorer load balancing effects.
Given that os_021 is the cause and since os_021 is a virtual machine hosting the app services, we suspect that the failure of os_021 meant that there are fewer machines processing the service requests. This is aligned with the num which decreased to half of its original level (because 1 out of 2 os is down). With a higher load across fewer available machines, this would lead to a higher average time taken to process each service request and fewer service requests processed in the same amount of time.
Analysis:
Given that Sent_queue is a host KPI to observe for failure #1, we plot the following graph - comparing Sent_queue values across 70 mins (corresponding to 14 data points) and against the train data for the same time period.
We observe that the Sent_queue value peaked right after the failure time (00:37:00), before returning back to normal after some time. Sent_queue represents the number of service requests being queued for processing. Combining this observation with our previous speculation (in ESB Indicators), we can confirm that the failure involves a virtual machine that hosts the app services in the distributed system network. Therefore, the failure of os_021 affected load balancing, causing service requests to take longer time to process and thus, fewer service requests handled in the same time period.
However, from the graph, we noticed that the accumulated requests dropped back to 0 soon after. This could mean that either the failure machine os_021 is fixed soon after, or the remaining serviceable machines managed to process the queue. It is a possibility that the load balancing algorithm adjusted itself, or the remaining machines scaled up vertically (to handle more requests in 1 machine) or horizontally (by spinning up another temporary machine to aid). The reason for this is not part of our analysis in Assignment 2.
In conclusion, we note that a possible failure could involve the virtual machine hosting the services. Such failures, although would not affect the service's success rate, would affect its latency and lead to slower execution. To identify these problems during production, we can observe for any drastic / abnormal increase in the operating system's Sent_queue metrics.
Analysis:
Since we know that os_021 has a failure, we will plot the following graphs - (1) comparing the length of trace data, (2) comparing the elapsed time of the parent processes for os_021, (3) comparing the elapsed time of the parent processes for os_022
These graphs' time period are carefully chosen to ensure that long-elapse-time functions are shown and there are ample data before & after to ascertain the trend.
From the graphs, we can tell that:
os_021 experienced a distinct increase in elapsed time as compared to train_data and os_022 (for the same time period)os_022 did not exeperience any change in elapsed timeFinding 1: osb_001 is the parent of csf_001 (thus, osb_001 is always larger than csf_001)
However, given that osb is the inside span while csf is the outside span, this meant that osb - csf will give us the actual time for osb. This prompts us to write a code to generate the actual time for each trace.
The logic of the code lies in that every trace will start with the inner span followed by 1 or more outer span childrens. Therefore, the actual time should be interpreted as:
Subsequently, we sampled for 100 traceIDs (since 100 is a large sample size for approximation) from both the train and test data to:
cmdb_id:serviceName:dsName)From the output of the above step, we notice that:
os_021:osb_001:nan has the largest difference of 2953.70 (vary due to sample)os_021:csf_001:nan has the 2nd largest difference of 1189.69 (vary due to sample)Therefore, we can conclude that the failure of os_021 can be attributed to the large increase in time taken to execute the osb_001 code. Additionally, since csf_001 is the outer span and child of osb_001, we can conclude that there is a large increase in time to pass control to other microservices called by osb_001.
What this means to the developer:
Code review should be performed on osb_001 to understand the cause for the increase in time taken to execute the code and call other microservices. Relating this back to the Host Indicators, we can suspect that the code cannot handle an increase in requests (i.e. Sent_queue increased sharply) well.
What this means to site reliability engineer / ML engineer:
We can potentially detect virtual machine failure by observing the osb_001 and csf_001 metrics, for large increase in execution time.
Finding the root cause for the slower processing speed
| unique_identifier | train_actual_time | test_actual_time | difference | |
|---|---|---|---|---|
| 46 | os_021:osb_001:nan | 14.80 | 2941.75 | -2926.95 |
| 45 | os_021:csf_001:nan | 5.80 | 1187.10 | -1181.30 |
| 17 | docker_004:csf_001:nan | 16.54 | 67.60 | -51.06 |
| 0 | docker_003:csf_001:nan | 20.24 | 60.82 | -40.58 |
| 16 | docker_003:nan:db_009 | 7.98 | 27.30 | -19.32 |
| 33 | docker_004:nan:db_009 | 8.12 | 18.12 | -10.00 |
| 9 | docker_003:local_method_004:db_009 | 3.36 | 11.51 | -8.15 |
| 4 | docker_003:csf_005:nan | 3.91 | 10.53 | -6.62 |
| 2 | docker_003:csf_003:nan | 2.36 | 8.74 | -6.38 |
| 19 | docker_004:csf_003:nan | 3.27 | 9.49 | -6.22 |
| 20 | docker_004:csf_004:nan | 2.12 | 7.55 | -5.43 |
| 1 | docker_003:csf_002:nan | 5.83 | 9.80 | -3.97 |
| 5 | docker_003:fly_remote_001:nan | 3.16 | 7.03 | -3.87 |
| 23 | docker_004:local_method_001:db_009 | 0.42 | 4.16 | -3.74 |
| 31 | docker_004:local_method_010:db_007 | 1.55 | 5.24 | -3.69 |
| 22 | docker_004:fly_remote_001:nan | 2.39 | 6.04 | -3.65 |
| 30 | docker_004:local_method_009:db_009 | 1.26 | 4.90 | -3.64 |
| 3 | docker_003:csf_004:nan | 3.38 | 6.97 | -3.59 |
| 18 | docker_004:csf_002:nan | 4.61 | 8.12 | -3.51 |
| 26 | docker_004:local_method_004:db_009 | 2.18 | 5.57 | -3.39 |
| 8 | docker_003:local_method_003:db_009 | 0.99 | 4.20 | -3.21 |
| 32 | docker_004:nan:db_007 | 3.25 | 6.14 | -2.89 |
| 15 | docker_003:nan:db_007 | 2.73 | 5.40 | -2.67 |
| 21 | docker_004:csf_005:nan | 3.52 | 6.00 | -2.48 |
| 12 | docker_003:local_method_007:db_009 | 1.11 | 3.47 | -2.36 |
| 27 | docker_004:local_method_005:db_009 | 0.51 | 2.34 | -1.83 |
| 28 | docker_004:local_method_006:db_009 | 1.92 | 3.65 | -1.73 |
| 6 | docker_003:local_method_001:db_009 | 0.43 | 2.11 | -1.68 |
| 29 | docker_004:local_method_007:db_009 | 0.59 | 2.26 | -1.67 |
| 24 | docker_004:local_method_002:db_009 | 1.98 | 3.39 | -1.41 |
| 10 | docker_003:local_method_005:db_009 | 1.15 | 2.49 | -1.34 |
| 14 | docker_003:local_method_010:db_007 | 2.31 | 3.65 | -1.34 |
| 7 | docker_003:local_method_002:db_009 | 3.47 | 4.79 | -1.32 |
| 13 | docker_003:local_method_009:db_009 | 2.17 | 3.32 | -1.15 |
| 11 | docker_003:local_method_006:db_009 | 2.79 | 3.86 | -1.07 |
| 25 | docker_004:local_method_003:db_009 | 0.48 | 0.57 | -0.09 |
| 34 | docker_006:csf_002:nan | 25.97 | NaN | NaN |
| 35 | docker_006:csf_003:nan | 14.11 | NaN | NaN |
| 36 | docker_006:csf_004:nan | 13.10 | NaN | NaN |
| 37 | docker_006:csf_005:nan | 11.68 | NaN | NaN |
| 38 | docker_006:local_method_011:db_003 | 2.28 | NaN | NaN |
| 39 | docker_006:local_method_013:db_003 | 2.97 | NaN | NaN |
| 40 | docker_006:local_method_014:db_003 | 1.63 | NaN | NaN |
| 41 | docker_006:local_method_015:db_003 | 2.41 | NaN | NaN |
| 42 | docker_006:local_method_016:db_003 | 1.73 | NaN | NaN |
| 43 | docker_006:local_method_017:db_003 | 2.18 | NaN | NaN |
| 44 | docker_006:nan:db_003 | 25.03 | NaN | NaN |
| 47 | docker_007:csf_002:nan | NaN | 13.22 | NaN |
| 48 | docker_007:csf_003:nan | NaN | 7.33 | NaN |
| 49 | docker_007:csf_004:nan | NaN | 6.87 | NaN |
| 50 | docker_007:csf_005:nan | NaN | 5.07 | NaN |
| 51 | docker_007:local_method_011:db_003 | NaN | 1.15 | NaN |
| 52 | docker_007:local_method_013:db_003 | NaN | 1.31 | NaN |
| 53 | docker_007:local_method_014:db_003 | NaN | 0.79 | NaN |
| 54 | docker_007:local_method_015:db_003 | NaN | 1.44 | NaN |
| 55 | docker_007:local_method_016:db_003 | NaN | 0.79 | NaN |
| 56 | docker_007:local_method_017:db_003 | NaN | 0.92 | NaN |
| 57 | docker_007:nan:db_003 | NaN | 12.63 | NaN |
| 58 | docker_008:csf_002:nan | NaN | 13.58 | NaN |
| 59 | docker_008:csf_003:nan | NaN | 6.22 | NaN |
| 60 | docker_008:csf_004:nan | NaN | 5.72 | NaN |
| 61 | docker_008:csf_005:nan | NaN | 5.96 | NaN |
| 62 | docker_008:local_method_011:db_003 | NaN | 1.05 | NaN |
| 63 | docker_008:local_method_013:db_003 | NaN | 0.99 | NaN |
| 64 | docker_008:local_method_014:db_003 | NaN | 1.02 | NaN |
| 65 | docker_008:local_method_015:db_003 | NaN | 1.06 | NaN |
| 66 | docker_008:local_method_016:db_003 | NaN | 0.67 | NaN |
| 67 | docker_008:local_method_017:db_003 | NaN | 0.98 | NaN |
| 68 | docker_008:nan:db_003 | NaN | 13.13 | NaN |
Analysis:
From the 4 line charts, we observe that soon after the failure time (01:56:00) the success rate dipped a bit before drastically dropping. This trend is confirmed by the num chart and succee_num chart, which show the values dipping slightly (from 01:58:00 to 01:59:00) before experiencing a sharp decline (from 02:02:00 to 02:03:00). Finally, the average time taken to process the service requests also increased slightly before a sharp increment later.
From these indicators, we cannot really deduce what is the problem. However, this could be indicative of a failure that propagated to the other microservices and caused them to malfunction. In this assumption, the small dip in success rate could be due to the initial failure, that was soon fixed. However, the failure could have propagated to other microservices, leading to a domino effect and causing a large number of service requests to fail.
| serviceName | startTime | avg_time | num | succee_num | succee_rate | |
|---|---|---|---|---|---|---|
| 118 | osb_001 | 01:58:00 | 2.1664 | 371 | 370 | 0.9973 |
| 119 | osb_001 | 01:59:00 | 1.1802 | 512 | 493 | 0.9629 |
| 122 | osb_001 | 02:02:00 | 0.7779 | 568 | 23 | 0.0405 |
| 123 | osb_001 | 02:03:00 | 5.1987 | 224 | 118 | 0.5268 |
| 124 | osb_001 | 02:04:00 | 1.6619 | 356 | 355 | 0.9972 |
FYI: From cmdb.xlsx, docker_007 is also mapped to os_019. However, we will only check the host data in dcos_docker.csv
Analysis:
Trying out all possible name for docker_007, we note the following observations:
container_cpu_used -> Although there are fluctuations compared to train data, they do not appear anomalous when compared other dockers (i.e. docker_006) in test datacontainer_mem_used -> Although there are fluctuations compared to train data, they do not appear anomalous when compared other dockers (i.e. docker_006) in test datacontainer_fgc -> All values are 0container_fgct -> All values are 0container_session_used -> All values are 0container_thread_idle -> All values are 0container_thread_running -> All values are 0container_thread_total -> All values are 0container_thread_used_pct -> All values are 0The major dips in the values were mostly observered around 02:06:00. Comparing the graphs against the ESB Indicators, we observe that these large decrease in values happen slightly after the drastic decrease in success rate. Therefore, we would suspect that they could be inter-related.
| itemid | name | bomc_id | timestamp | value | cmdb_id | |
|---|---|---|---|---|---|---|
| 8108 | 999999996381307 | container_mem_used | ZJ-004-060 | 02:06:04 | 42.0 | docker_007 |
| 8135 | 999999996381307 | container_mem_used | ZJ-004-060 | 02:06:28 | 50.0 | docker_007 |
| 8216 | 999999996381307 | container_mem_used | ZJ-004-060 | 02:07:41 | 55.0 | docker_007 |
| 8279 | 999999996381307 | container_mem_used | ZJ-004-060 | 02:08:37 | 55.0 | docker_007 |
Analysis:
Since we know that docker_007 has a failure and that all traceId begin from a root process that must start from either os_021 or os_022 (refer to Data Exploration's Finding #6), we will plot the following graphs - (1) comparing the length of trace data, (2) comparing the elapsed time of the parent processes for os_021, (3) comparing the elapsed time of the parent processes for os_022
These graphs' time period are carefully chosen to ensure that long-elapse-time functions are shown and there are ample data before & after to ascertain the trend.
From the graphs, we can tell that:
os_021 experienced (1) a distinct fluctuation in elapsed time (even going negative) as compared to train_data and os_022 (for the same time period), (2) higher elapsed time around 01:59:00 and 02:03:00os_022 experienced (1) lower call frequency around 01:59:00 and 02:02:00, (2) higher elapsed time around 02:00:00 and 02:03:00Finding 1: Most traceIds contain traces from different dockers e.g. `docker_003`, `docker_004`, `docker_006`, etc
However, docker_007 is more similar to docker_005 to docker_008 than docker_001 to docker_004. Therefore, this will shape the logic subsequently, when we compared the docker metrics across to detect anomalies.
Subsequently, we sampled for 100 traceIDs (since 100 is a large sample size for approximation) from both the train and test data to:
cmdb_id:serviceName:dsName)docker_006, docker_007, docker_008 in the uniquer_identifier (since they share similar metrics and trace calls)From the output graph of the above step, we notice that:
csf_002:None has the largest difference (vary due to sample)Therefore, since csf_002 is the outer span, we can conclude that the failure of docker_007 can be attributed to the large increase in time to pass control to other microservices called by csf_002.
What this means to the developer:
Code review should be performed on docker_007:csf_002:None to understand the cause for the increase in time taken to call other microservices.
What this means to site reliability engineer / ML engineer:
We can potentially detect docker failure (for docker_005 to docker_008) by observing the csf_002:None metrics, for large increase in actual time. Alternatively, we can try using trace invocation length as an additional indicator.
Finding the root cause for the slower processing speed
| unique_identifier | train_actual_time | test_actual_time | difference | |
|---|---|---|---|---|
| 0 | docker_003:csf_001:nan | 24.98 | 2052.32 | -2027.34 |
| 17 | docker_004:csf_001:nan | 15.57 | 1390.54 | -1374.97 |
| 45 | os_021:csf_001:nan | 8.25 | 664.86 | -656.61 |
| 26 | docker_004:local_method_004:db_009 | 4.38 | 238.56 | -234.18 |
| 4 | docker_003:csf_005:nan | 2.87 | 233.19 | -230.32 |
| 1 | docker_003:csf_002:nan | 3.58 | 203.80 | -200.22 |
| 9 | docker_003:local_method_004:db_009 | 5.66 | 179.34 | -173.68 |
| 3 | docker_003:csf_004:nan | 3.32 | 149.33 | -146.01 |
| 24 | docker_004:local_method_002:db_009 | 2.10 | 146.53 | -144.43 |
| 18 | docker_004:csf_002:nan | 4.68 | 134.23 | -129.55 |
| 16 | docker_003:nan:db_009 | 7.11 | 127.81 | -120.70 |
| 7 | docker_003:local_method_002:db_009 | 3.66 | 106.97 | -103.31 |
| 33 | docker_004:nan:db_009 | 7.54 | 106.69 | -99.15 |
| 19 | docker_004:csf_003:nan | 4.25 | 92.63 | -88.38 |
| 28 | docker_004:local_method_006:db_009 | 2.21 | 90.27 | -88.06 |
| 6 | docker_003:local_method_001:db_009 | 1.78 | 84.31 | -82.53 |
| 20 | docker_004:csf_004:nan | 4.10 | 84.77 | -80.67 |
| 11 | docker_003:local_method_006:db_009 | 2.24 | 75.86 | -73.62 |
| 31 | docker_004:local_method_010:db_007 | 1.42 | 71.01 | -69.59 |
| 2 | docker_003:csf_003:nan | 3.89 | 70.40 | -66.51 |
| 21 | docker_004:csf_005:nan | 1.89 | 66.99 | -65.10 |
| 14 | docker_003:local_method_010:db_007 | 1.87 | 62.79 | -60.92 |
| 10 | docker_003:local_method_005:db_009 | 0.77 | 47.80 | -47.03 |
| 30 | docker_004:local_method_009:db_009 | 2.91 | 48.47 | -45.56 |
| 29 | docker_004:local_method_007:db_009 | 1.48 | 42.46 | -40.98 |
| 27 | docker_004:local_method_005:db_009 | 1.18 | 42.13 | -40.95 |
| 8 | docker_003:local_method_003:db_009 | 0.75 | 37.83 | -37.08 |
| 12 | docker_003:local_method_007:db_009 | 0.69 | 36.74 | -36.05 |
| 32 | docker_004:nan:db_007 | 3.09 | 38.00 | -34.91 |
| 13 | docker_003:local_method_009:db_009 | 1.38 | 33.58 | -32.20 |
| 25 | docker_004:local_method_003:db_009 | 0.77 | 29.06 | -28.29 |
| 22 | docker_004:fly_remote_001:nan | 3.44 | 30.26 | -26.82 |
| 23 | docker_004:local_method_001:db_009 | 0.45 | 23.39 | -22.94 |
| 5 | docker_003:fly_remote_001:nan | 3.78 | 25.70 | -21.92 |
| 15 | docker_003:nan:db_007 | 4.55 | 26.21 | -21.66 |
| 46 | os_021:osb_001:nan | 18.60 | 34.32 | -15.72 |
| 34 | docker_006:csf_002:nan | 26.38 | NaN | NaN |
| 35 | docker_006:csf_003:nan | 14.16 | NaN | NaN |
| 36 | docker_006:csf_004:nan | 12.94 | NaN | NaN |
| 37 | docker_006:csf_005:nan | 12.24 | NaN | NaN |
| 38 | docker_006:local_method_011:db_003 | 2.15 | NaN | NaN |
| 39 | docker_006:local_method_013:db_003 | 2.61 | NaN | NaN |
| 40 | docker_006:local_method_014:db_003 | 2.13 | NaN | NaN |
| 41 | docker_006:local_method_015:db_003 | 2.95 | NaN | NaN |
| 42 | docker_006:local_method_016:db_003 | 1.80 | NaN | NaN |
| 43 | docker_006:local_method_017:db_003 | 2.24 | NaN | NaN |
| 44 | docker_006:nan:db_003 | 25.62 | NaN | NaN |
| 47 | docker_007:csf_002:nan | NaN | 89.90 | NaN |
| 48 | docker_007:csf_003:nan | NaN | 31.01 | NaN |
| 49 | docker_007:csf_004:nan | NaN | 14.09 | NaN |
| 50 | docker_007:csf_005:nan | NaN | 8.20 | NaN |
| 51 | docker_007:local_method_011:db_003 | NaN | 3.34 | NaN |
| 52 | docker_007:local_method_013:db_003 | NaN | 2.98 | NaN |
| 53 | docker_007:local_method_014:db_003 | NaN | 2.61 | NaN |
| 54 | docker_007:local_method_015:db_003 | NaN | 3.02 | NaN |
| 55 | docker_007:local_method_016:db_003 | NaN | 2.09 | NaN |
| 56 | docker_007:local_method_017:db_003 | NaN | 1.75 | NaN |
| 57 | docker_007:nan:db_003 | NaN | 15.77 | NaN |
| 58 | docker_008:csf_002:nan | NaN | 49.84 | NaN |
| 59 | docker_008:csf_003:nan | NaN | 11.46 | NaN |
| 60 | docker_008:csf_004:nan | NaN | 17.50 | NaN |
| 61 | docker_008:csf_005:nan | NaN | 10.61 | NaN |
| 62 | docker_008:local_method_011:db_003 | NaN | 3.53 | NaN |
| 63 | docker_008:local_method_013:db_003 | NaN | 3.55 | NaN |
| 64 | docker_008:local_method_014:db_003 | NaN | 1.07 | NaN |
| 65 | docker_008:local_method_015:db_003 | NaN | 2.69 | NaN |
| 66 | docker_008:local_method_016:db_003 | NaN | 1.10 | NaN |
| 67 | docker_008:local_method_017:db_003 | NaN | 1.79 | NaN |
| 68 | docker_008:nan:db_003 | NaN | 14.62 | NaN |
Failure #3 Details:
[
12480,
[
["os_017", "Recv_total"],
["os_017", "Send_total"]
]
]
Analysis: From the 4 line charts, we observe that although the success rate did not change (subplot #4 remained constant at 1), the average time taken to process each service call increased sharply and the number of services requests handled decreased around the failure time period. This unusual behaviour can be noted by comparing against itself across 1 hour, as well as comparing against the train data (normal data with no failures) across the same time period. This indicates that failure #1 does not result in a failure of a service call, but could lead to other side effects such as longer call times and fewer services requests handled. This phenomenon could potentially be due to the app's error handling or simply due to poorer load balancing effects.
Given that os_017 is the cause and since os_017 is a virtual machine hosting the app services (2 dockers corresponding to each app), we suspect that the failure of os_017 meant that there are fewer machines processing the service requests. Interestingly, the num decreased to lower than 20% of its original level despite os_017 only making up 1 of the 4 other OS that hosts both dockers for each app. Therefore, we will need to investigate more to understand this phenomenon.
Analysis:
Given that Recv_total and Send_total are host KPIs to observe for failure #3, we plot the following graph - comparing Recv_total and Send_total values across 60 mins (corresponding to 60 data points) and against the train data for the same time period.
We observe that the both KPI values experienced a drastic decrease after the failure time (03:28:00), before returning back to normal around 03:35:00. Recv_total and Send_total represent the number of service requests received for processing and sent after processing.
In conclusion, we note that a possible failure could involve the virtual machine hosting the services. Such failures, although would not affect the service's success rate, would affect its latency and lead to slower execution. To identify these problems during production, we can observe for any drastic / abnormal increase in the operating system's Recv_total and Send_total metrics.
Analysis:
Since we know that os_017 (which hosts docker_001 and docker_005) has a failure and that all traceId begin from a root process that must start from either os_021 or os_022 (refer to Data Exploration's Finding #6), we will plot the following graphs - (1) comparing the length of trace data, (2) comparing the elapsed time of the parent processes for os_021, (3) comparing the elapsed time of the parent processes for os_022
These graphs' time period are carefully chosen to ensure that long-elapse-time functions are shown and there are ample data before & after to ascertain the trend.
From the graphs, we can tell that:
os_021 did not exeperience much change in elapsed time (though it has lower call frequency)os_022 experienced higher elapsed time persisting for about 5 mins (from 03:28:00 to 03:33:00)Similar to Failure #2's Trace Indicators Explanation, we will compare specific dockers
docker_001 is more similar to docker_002 to docker_004 while docker_005 is more similar to docker_006 to docker_008. Therefore, this will shape the logic subsequently, when we compared the docker metrics across to detect anomalies.
Subsequently, we sampled for 100 traceIDs (since 100 is a large sample size for approximation) from both the train and test data to:
cmdb_id:serviceName:dsName)docker_001, docker_002 in the uniquer_identifier (since they share similar metrics and trace calls)docker_005, docker_006, docker_007, docker_008 in the uniquer_identifier (since they share similar metrics and trace calls)From the output graph of the above step, we notice that:
docker_001, None:db_009 and csf_001:None have the largest differences docker_005, None:db_003 and csf methods (i.e. from csf_002:None to csf_005:None) have the largest differencesTherefore, since all the above metrics identified are outer span, we can conclude that the failure of os_017 can be attributed to the large increase in time to pass control to other microservices.
What this means to the developer: Code review should be performed on the above metrics to understand the cause for the increase in time taken to call other microservices.
What this means to site reliability engineer / ML engineer:
We can potentially detect the failure of OS that host docker containers (i.e. os_017 to os_020) by observing the the above identified metrics, for large increase in actual time as compared to train data.
Finding the root cause for the slower processing speed
| unique_identifier | train_actual_time | test_actual_time | difference | |
|---|---|---|---|---|
| 16 | docker_001:nan:db_009 | 11.77 | 3170.99 | -3159.22 |
| 0 | docker_001:csf_001:nan | 37.70 | 2814.41 | -2776.71 |
| 56 | os_022:csf_001:nan | 7.91 | 595.01 | -587.10 |
| 20 | docker_002:csf_004:nan | 8.19 | 576.10 | -567.91 |
| 18 | docker_002:csf_002:nan | 10.74 | 537.44 | -526.70 |
| 19 | docker_002:csf_003:nan | 7.01 | 470.91 | -463.90 |
| 15 | docker_001:nan:db_007 | 4.84 | 442.89 | -438.05 |
| 21 | docker_002:csf_005:nan | 3.08 | 348.05 | -344.97 |
| 2 | docker_001:csf_003:nan | 4.89 | 271.58 | -266.69 |
| 4 | docker_001:csf_005:nan | 8.16 | 248.69 | -240.53 |
| 5 | docker_001:fly_remote_001:nan | 2.64 | 233.06 | -230.42 |
| 3 | docker_001:csf_004:nan | 8.69 | 227.82 | -219.13 |
| 1 | docker_001:csf_002:nan | 6.84 | 224.58 | -217.74 |
| 6 | docker_001:local_method_001:db_009 | 0.58 | 92.43 | -91.85 |
| 57 | os_022:osb_001:nan | 16.12 | 90.10 | -73.98 |
| 17 | docker_002:csf_001:nan | 23.69 | 82.70 | -59.01 |
| 11 | docker_001:local_method_006:db_009 | 2.14 | 4.11 | -1.97 |
| 29 | docker_002:local_method_007:db_009 | 0.55 | 1.41 | -0.86 |
| 22 | docker_002:fly_remote_001:nan | 4.25 | 4.52 | -0.27 |
| 10 | docker_001:local_method_005:db_009 | 2.13 | 2.05 | 0.08 |
| 24 | docker_002:local_method_002:db_009 | 2.39 | 2.30 | 0.09 |
| 8 | docker_001:local_method_003:db_009 | 1.55 | 1.40 | 0.15 |
| 25 | docker_002:local_method_003:db_009 | 0.66 | 0.49 | 0.17 |
| 27 | docker_002:local_method_005:db_009 | 2.00 | 1.27 | 0.73 |
| 23 | docker_002:local_method_001:db_009 | 2.09 | 1.17 | 0.92 |
| 13 | docker_001:local_method_009:db_009 | 1.58 | 0.62 | 0.96 |
| 12 | docker_001:local_method_007:db_009 | 2.74 | 0.64 | 2.10 |
| 30 | docker_002:local_method_009:db_009 | 2.60 | 0.46 | 2.14 |
| 26 | docker_002:local_method_004:db_009 | 5.87 | 3.72 | 2.15 |
| 31 | docker_002:local_method_010:db_007 | 3.30 | 0.87 | 2.43 |
| 7 | docker_001:local_method_002:db_009 | 4.65 | 1.79 | 2.86 |
| 28 | docker_002:local_method_006:db_009 | 4.98 | 1.67 | 3.31 |
| 33 | docker_002:nan:db_009 | 9.95 | 6.46 | 3.49 |
| 9 | docker_001:local_method_004:db_009 | 6.45 | 2.75 | 3.70 |
| 14 | docker_001:local_method_010:db_007 | 5.19 | 0.97 | 4.22 |
| 32 | docker_002:nan:db_007 | 9.49 | 4.58 | 4.91 |
| 34 | docker_007:csf_002:nan | 14.12 | NaN | NaN |
| 35 | docker_007:csf_003:nan | 7.29 | NaN | NaN |
| 36 | docker_007:csf_004:nan | 7.19 | NaN | NaN |
| 37 | docker_007:csf_005:nan | 6.70 | NaN | NaN |
| 38 | docker_007:local_method_011:db_003 | 1.10 | NaN | NaN |
| 39 | docker_007:local_method_013:db_003 | 1.33 | NaN | NaN |
| 40 | docker_007:local_method_014:db_003 | 0.92 | NaN | NaN |
| 41 | docker_007:local_method_015:db_003 | 1.19 | NaN | NaN |
| 42 | docker_007:local_method_016:db_003 | 0.96 | NaN | NaN |
| 43 | docker_007:local_method_017:db_003 | 1.63 | NaN | NaN |
| 44 | docker_007:nan:db_003 | 14.52 | NaN | NaN |
| 45 | docker_008:csf_002:nan | 13.42 | NaN | NaN |
| 46 | docker_008:csf_003:nan | 7.79 | NaN | NaN |
| 47 | docker_008:csf_004:nan | 8.13 | NaN | NaN |
| 48 | docker_008:csf_005:nan | 6.61 | NaN | NaN |
| 49 | docker_008:local_method_011:db_003 | 1.09 | NaN | NaN |
| 50 | docker_008:local_method_013:db_003 | 1.49 | NaN | NaN |
| 51 | docker_008:local_method_014:db_003 | 0.90 | NaN | NaN |
| 52 | docker_008:local_method_015:db_003 | 1.12 | NaN | NaN |
| 53 | docker_008:local_method_016:db_003 | 0.74 | NaN | NaN |
| 54 | docker_008:local_method_017:db_003 | 0.89 | NaN | NaN |
| 55 | docker_008:nan:db_003 | 13.59 | NaN | NaN |
| 58 | docker_005:csf_002:nan | NaN | 1276.10 | NaN |
| 59 | docker_005:csf_003:nan | NaN | 547.34 | NaN |
| 60 | docker_005:csf_004:nan | NaN | 639.52 | NaN |
| 61 | docker_005:csf_005:nan | NaN | 495.06 | NaN |
| 62 | docker_005:local_method_011:db_003 | NaN | 1.37 | NaN |
| 63 | docker_005:local_method_013:db_003 | NaN | 1.51 | NaN |
| 64 | docker_005:local_method_014:db_003 | NaN | 0.89 | NaN |
| 65 | docker_005:local_method_015:db_003 | NaN | 1.64 | NaN |
| 66 | docker_005:local_method_016:db_003 | NaN | 1.08 | NaN |
| 67 | docker_005:local_method_017:db_003 | NaN | 1.18 | NaN |
| 68 | docker_005:nan:db_003 | NaN | 2890.51 | NaN |
| 69 | docker_006:csf_002:nan | NaN | 9.19 | NaN |
| 70 | docker_006:csf_003:nan | NaN | 5.05 | NaN |
| 71 | docker_006:csf_004:nan | NaN | 3.64 | NaN |
| 72 | docker_006:csf_005:nan | NaN | 4.70 | NaN |
| 73 | docker_006:local_method_011:db_003 | NaN | 0.64 | NaN |
| 74 | docker_006:local_method_013:db_003 | NaN | 0.71 | NaN |
| 75 | docker_006:local_method_014:db_003 | NaN | 0.51 | NaN |
| 76 | docker_006:local_method_015:db_003 | NaN | 0.58 | NaN |
| 77 | docker_006:local_method_016:db_003 | NaN | 0.41 | NaN |
| 78 | docker_006:local_method_017:db_003 | NaN | 0.77 | NaN |
| 79 | docker_006:nan:db_003 | NaN | 10.50 | NaN |
Failure #4 Details:
[
16080,
[
["db_003", "On_Off_State"],
["db_003", "tnsping_result_time"]
]
]
Analysis:
From the 4 line charts, we observe that soon after the failure time (04:28:00) the success rate decreased drastically twice consecutively, with the first occurence taking a longer time to recover than the second. This trend is a result of fewer successful calls (the graph for it shares similar pattern with success rate) as compared to num. On the other hand, the average time to process the service requests fluctuated after the failure time, dipping then increasing sharply.
The above patterns repeated itself for once more soon after the system recovered.
From these indicators, we suspect that a fault with 1 of the microservices led to a faster-than-before return of the service request, in the form of a failed request. This is supported by the shorter average time accompanied with lower success rate. On the other hand, the peak in average time can be explained as the "cold start" for connecting to the microservice that could have just recovered. Nonetheless, these are speculations that need to be affirmed through other indicators.
Analysis:
Given that On_Off_State and tnsping_result_time are host KPIs to observe for failure #4, we plot the following graph - comparing Recv_total and Send_total values across 60 mins (corresponding to 60 data points) and against the train data for the same time period.
We observe that On_Off_State was toggled from 1 (indicating "on") to 0 (indicating "off") during the failure time. At the same time, tnsping_result_time increased sharply to 9,999 ms (which is likely the timeout duration). These indicate that the database was "switched off" / unavailable during the time of failure and it led to timeout errors when attempting to connect to the affected database.
In conclusion, we note that a possible failure could involve the database that services require. Such failures would affect the service's success rate, and affect its latency while the distributed system attempts to fix the problem (e.g. by rerouting to another database). To identify these problems during production, we can observe for any drastic / abnormal increase in the operating system's On_Off_State and tnsping_result_time metrics. Specifically, we can observe for the time period when On_Off_State == 0 and tnsping_result_time == 9999.
Analysis:
Since we know that db_003 has a failure and that all traceId begin from a root process that must start from either os_021 or os_022 (refer to Data Exploration's Finding #6), we will plot the following graphs - (1) comparing the length of trace data, (2) comparing the elapsed time of the parent processes for os_021, (3) comparing the elapsed time of the parent processes for os_022
These graphs' time period are carefully chosen to ensure that long-elapse-time functions are shown and there are ample data before & after to ascertain the trend.
From the graphs, we can tell that:
os_021 experienced (1) lower elapsed time from 04:28:00 to 04:37:00, (2) higher elapsed time from 04:37:00 and 04:45:00os_022 experienced (1) lower elapsed time from 04:28:00 to 04:37:00, (2) a spike in elapsed time around 04:29:00 to 04:30:00, (3) higher elapsed time around 04:37:00 and 04:45:00Observation 1 corresponds very much with the os_021 graph, with the fluctuations matching closely.
Similar to Failure #2's Trace Indicators Explanation, we will try to compare specific databases
db_003 appears to be mapped with local_method_011 to local_method_017 (excluding local_method_012), therefore mapped with docker_005 to docker_008. db_009 appears to be mapped with local_method_001 to local_method_010 (excluding local_method_008), therefore mapped with docker_001 to docker_004.
Therefore, this will shape the logic subsequently, when we compared the docker metrics across to detect anomalies.
Subsequently, we sampled for 100 traceIDs (since 100 is a large sample size for approximation) from both the train and test data to:
cmdb_id:serviceName:dsName)docker_006, docker_007, docker_008 in the uniquer_identifier (since they share similar metrics and trace calls)Note that docker_006 is from train data, therefore it is faultless, while docker_007 and docker_008 are from test data.
From the output graph of the above step, we notice that:
db_003 has fault, then the invocation trace length will be shorter (2 instead of 11). This means that the invocation stopped and possibly an error message is returned.db_003 has fault, then there would be a large difference in actual time for None:db_003. This can be understood as the time to query the database is longer than the timeout duration.What this means to the developer: Code review should be performed to understand the cause for the database failure and if the aftermath effects are intended and errors are properly handled.
What this means to site reliability engineer / ML engineer:
We can potentially detect database failure by observing the None:db_003 metrics, for shorter actual time. Additionally, we can use the trace invocation length as another indicator for failures.
Finding the root cause for the slower processing speed
| unique_identifier | train_actual_time | test_actual_time | difference | |
|---|---|---|---|---|
| 46 | os_021:osb_001:nan | 14.33 | 17.23 | -2.90 |
| 18 | docker_004:csf_002:nan | 3.38 | 4.44 | -1.06 |
| 1 | docker_003:csf_002:nan | 4.00 | 4.20 | -0.20 |
| 22 | docker_004:fly_remote_001:nan | 3.03 | 2.10 | 0.93 |
| 5 | docker_003:fly_remote_001:nan | 3.56 | 2.59 | 0.97 |
| 45 | os_021:csf_001:nan | 8.49 | 2.79 | 5.70 |
| 0 | docker_003:csf_001:nan | 16.78 | 8.36 | 8.42 |
| 17 | docker_004:csf_001:nan | 16.77 | 6.96 | 9.81 |
| 2 | docker_003:csf_003:nan | 2.24 | NaN | NaN |
| 3 | docker_003:csf_004:nan | 2.63 | NaN | NaN |
| 4 | docker_003:csf_005:nan | 3.00 | NaN | NaN |
| 6 | docker_003:local_method_001:db_009 | 0.44 | NaN | NaN |
| 7 | docker_003:local_method_002:db_009 | 2.10 | NaN | NaN |
| 8 | docker_003:local_method_003:db_009 | 1.23 | NaN | NaN |
| 9 | docker_003:local_method_004:db_009 | 5.06 | NaN | NaN |
| 10 | docker_003:local_method_005:db_009 | 0.70 | NaN | NaN |
| 11 | docker_003:local_method_006:db_009 | 2.39 | NaN | NaN |
| 12 | docker_003:local_method_007:db_009 | 2.26 | NaN | NaN |
| 13 | docker_003:local_method_009:db_009 | 1.43 | NaN | NaN |
| 14 | docker_003:local_method_010:db_007 | 2.30 | NaN | NaN |
| 15 | docker_003:nan:db_007 | 4.18 | NaN | NaN |
| 16 | docker_003:nan:db_009 | 5.73 | NaN | NaN |
| 19 | docker_004:csf_003:nan | 3.57 | NaN | NaN |
| 20 | docker_004:csf_004:nan | 4.48 | NaN | NaN |
| 21 | docker_004:csf_005:nan | 3.42 | NaN | NaN |
| 23 | docker_004:local_method_001:db_009 | 0.51 | NaN | NaN |
| 24 | docker_004:local_method_002:db_009 | 2.47 | NaN | NaN |
| 25 | docker_004:local_method_003:db_009 | 0.64 | NaN | NaN |
| 26 | docker_004:local_method_004:db_009 | 3.84 | NaN | NaN |
| 27 | docker_004:local_method_005:db_009 | 1.87 | NaN | NaN |
| 28 | docker_004:local_method_006:db_009 | 2.97 | NaN | NaN |
| 29 | docker_004:local_method_007:db_009 | 2.48 | NaN | NaN |
| 30 | docker_004:local_method_009:db_009 | 1.28 | NaN | NaN |
| 31 | docker_004:local_method_010:db_007 | 2.66 | NaN | NaN |
| 32 | docker_004:nan:db_007 | 4.76 | NaN | NaN |
| 33 | docker_004:nan:db_009 | 7.70 | NaN | NaN |
| 34 | docker_006:csf_002:nan | 26.60 | NaN | NaN |
| 35 | docker_006:csf_003:nan | 14.76 | NaN | NaN |
| 36 | docker_006:csf_004:nan | 14.70 | NaN | NaN |
| 37 | docker_006:csf_005:nan | 12.31 | NaN | NaN |
| 38 | docker_006:local_method_011:db_003 | 2.26 | NaN | NaN |
| 39 | docker_006:local_method_013:db_003 | 3.58 | NaN | NaN |
| 40 | docker_006:local_method_014:db_003 | 1.93 | NaN | NaN |
| 41 | docker_006:local_method_015:db_003 | 3.99 | NaN | NaN |
| 42 | docker_006:local_method_016:db_003 | 2.07 | NaN | NaN |
| 43 | docker_006:local_method_017:db_003 | 3.08 | NaN | NaN |
| 44 | docker_006:nan:db_003 | 26.82 | NaN | NaN |
| 47 | docker_007:csf_002:nan | NaN | 11.91 | NaN |
| 48 | docker_007:local_method_011:db_003 | NaN | 2.56 | NaN |
| 49 | docker_008:csf_002:nan | NaN | 10.19 | NaN |
| 50 | docker_008:local_method_011:db_003 | NaN | 2.26 | NaN |
Analysis: From the 4 line charts, we observe that although the success rate did not change (subplot #4 remained constant at 1), the average time taken to process each service call increased sharply and the number of services requests handled decreased around the failure time period. This unusual behaviour can be noted by comparing against itself across 1 hour, as well as comparing against the train data (normal data with no failures) across the same time period. This indicates that failure #1 does not result in a failure of a service call, but could lead to other side effects such as longer call times and fewer services requests handled. This phenomenon could potentially be due to the app's error handling or simply due to poorer load balancing effects.
FYI: From cmdb.xlsx, docker_005 is also mapped to os_017. However, we will only check the host data in dcos_docker.csv
Analysis:
Trying out all possible name for docker_005, we note the following observations:
container_cpu_used -> Although there are fluctuations compared to train data, they do not appear anomalous when compared other dockers (i.e. docker_007) in test datacontainer_mem_used -> Although there are fluctuations compared to train data, they do not appear anomalous when compared other dockers (i.e. docker_007) in test datacontainer_fgc -> All values are 0container_fgct -> All values are 0container_session_used -> All values are 0container_thread_idle -> All values are 0container_thread_running -> All values are 0container_thread_total -> All values are 0container_thread_used_pct -> All values are 0Analysis:
Since we know that docker_005 has a failure and that all traceId begin from a root process that must start from either os_021 or os_022 (refer to Data Exploration's Finding #6), we will plot the following graphs - (1) comparing the length of trace data, (2) comparing the elapsed time of the parent processes for os_021, (3) comparing the elapsed time of the parent processes for os_022
These graphs' time period are carefully chosen to ensure that long-elapse-time functions are shown and there are ample data before & after to ascertain the trend.
From the graphs, we can tell that:
os_021 experienced (1) reduction in service call frequency, (2) lower elapsed timeos_022 experienced significantly higher elapsed timeSimilar to Failure #2's Trace Indicators Explanation, we will compare specific dockers
docker_005 is more similar to docker_006 to docker_008. Therefore, this will shape the logic subsequently, when we compared the docker metrics across to detect anomalies.
Subsequently, we sampled for 100 traceIDs (since 100 is a large sample size for approximation) from both the train and test data to:
cmdb_id:serviceName:dsName)docker_005, docker_006 (docker_007 and docker_008 are not as relevant as docker_006 which is the non-faulty docker in similar circumstances / system) in the uniquer_identifier (since they share similar metrics and trace calls)From the output graph of the above step, we notice that:
None:db_003 and csf_002:None has the largest difference between docker_005, docker_006Therefore, since db_003 and csf_002 are outer spans, we can conclude that the failure of docker_005 can be attributed to the large increase in time to pass control to other microservices called by csf_002 and the database.
What this means to the developer:
Code review should be performed on docker_005:csf_002:None to understand the cause for the increase in time taken to call other microservices.
What this means to site reliability engineer / ML engineer:
We can potentially detect docker failure (for docker_005 to docker_008) by observing the db_003 and csf_002:None metrics, for large increase in actual time.
Finding the root cause for the slower processing speed
| unique_identifier | train_actual_time | test_actual_time | difference | |
|---|---|---|---|---|
| 4 | docker_001:csf_005:nan | 6.57 | 842.92 | -836.35 |
| 2 | docker_001:csf_003:nan | 7.21 | 795.59 | -788.38 |
| 21 | docker_002:csf_005:nan | 3.63 | 786.65 | -783.02 |
| 1 | docker_001:csf_002:nan | 8.32 | 782.03 | -773.71 |
| 0 | docker_001:csf_001:nan | 27.15 | 789.79 | -762.64 |
| 3 | docker_001:csf_004:nan | 7.90 | 709.82 | -701.92 |
| 18 | docker_002:csf_002:nan | 6.26 | 706.31 | -700.05 |
| 20 | docker_002:csf_004:nan | 6.13 | 682.07 | -675.94 |
| 19 | docker_002:csf_003:nan | 5.17 | 671.45 | -666.28 |
| 17 | docker_002:csf_001:nan | 27.61 | 471.95 | -444.34 |
| 57 | os_022:osb_001:nan | 15.60 | 110.88 | -95.28 |
| 56 | os_022:csf_001:nan | 9.58 | 35.59 | -26.01 |
| 22 | docker_002:fly_remote_001:nan | 2.92 | 8.16 | -5.24 |
| 5 | docker_001:fly_remote_001:nan | 2.29 | 5.73 | -3.44 |
| 24 | docker_002:local_method_002:db_009 | 1.80 | 3.44 | -1.64 |
| 28 | docker_002:local_method_006:db_009 | 2.08 | 3.46 | -1.38 |
| 32 | docker_002:nan:db_007 | 3.18 | 4.47 | -1.29 |
| 12 | docker_001:local_method_007:db_009 | 1.47 | 2.10 | -0.63 |
| 15 | docker_001:nan:db_007 | 2.93 | 3.44 | -0.51 |
| 27 | docker_002:local_method_005:db_009 | 2.11 | 2.20 | -0.09 |
| 29 | docker_002:local_method_007:db_009 | 0.64 | 0.58 | 0.06 |
| 13 | docker_001:local_method_009:db_009 | 0.91 | 0.84 | 0.07 |
| 10 | docker_001:local_method_005:db_009 | 0.85 | 0.57 | 0.28 |
| 31 | docker_002:local_method_010:db_007 | 1.83 | 1.53 | 0.30 |
| 25 | docker_002:local_method_003:db_009 | 0.82 | 0.50 | 0.32 |
| 7 | docker_001:local_method_002:db_009 | 4.00 | 3.17 | 0.83 |
| 8 | docker_001:local_method_003:db_009 | 1.83 | 0.48 | 1.35 |
| 6 | docker_001:local_method_001:db_009 | 1.76 | 0.40 | 1.36 |
| 14 | docker_001:local_method_010:db_007 | 4.61 | 3.18 | 1.43 |
| 30 | docker_002:local_method_009:db_009 | 3.11 | 1.33 | 1.78 |
| 16 | docker_001:nan:db_009 | 11.42 | 9.45 | 1.97 |
| 23 | docker_002:local_method_001:db_009 | 2.61 | 0.37 | 2.24 |
| 33 | docker_002:nan:db_009 | 9.51 | 6.86 | 2.65 |
| 11 | docker_001:local_method_006:db_009 | 4.65 | 1.05 | 3.60 |
| 26 | docker_002:local_method_004:db_009 | 6.30 | 1.88 | 4.42 |
| 9 | docker_001:local_method_004:db_009 | 9.00 | 4.16 | 4.84 |
| 34 | docker_007:csf_002:nan | 12.29 | NaN | NaN |
| 35 | docker_007:csf_003:nan | 8.82 | NaN | NaN |
| 36 | docker_007:csf_004:nan | 8.27 | NaN | NaN |
| 37 | docker_007:csf_005:nan | 6.36 | NaN | NaN |
| 38 | docker_007:local_method_011:db_003 | 0.96 | NaN | NaN |
| 39 | docker_007:local_method_013:db_003 | 1.69 | NaN | NaN |
| 40 | docker_007:local_method_014:db_003 | 0.94 | NaN | NaN |
| 41 | docker_007:local_method_015:db_003 | 1.68 | NaN | NaN |
| 42 | docker_007:local_method_016:db_003 | 0.95 | NaN | NaN |
| 43 | docker_007:local_method_017:db_003 | 1.12 | NaN | NaN |
| 44 | docker_007:nan:db_003 | 14.30 | NaN | NaN |
| 45 | docker_008:csf_002:nan | 14.21 | NaN | NaN |
| 46 | docker_008:csf_003:nan | 5.08 | NaN | NaN |
| 47 | docker_008:csf_004:nan | 5.12 | NaN | NaN |
| 48 | docker_008:csf_005:nan | 5.91 | NaN | NaN |
| 49 | docker_008:local_method_011:db_003 | 1.07 | NaN | NaN |
| 50 | docker_008:local_method_013:db_003 | 1.04 | NaN | NaN |
| 51 | docker_008:local_method_014:db_003 | 0.71 | NaN | NaN |
| 52 | docker_008:local_method_015:db_003 | 0.91 | NaN | NaN |
| 53 | docker_008:local_method_016:db_003 | 0.63 | NaN | NaN |
| 54 | docker_008:local_method_017:db_003 | 1.36 | NaN | NaN |
| 55 | docker_008:nan:db_003 | 11.54 | NaN | NaN |
| 58 | docker_005:csf_002:nan | NaN | 14.72 | NaN |
| 59 | docker_005:csf_003:nan | NaN | 6.78 | NaN |
| 60 | docker_005:csf_004:nan | NaN | 6.85 | NaN |
| 61 | docker_005:csf_005:nan | NaN | 6.86 | NaN |
| 62 | docker_005:local_method_011:db_003 | NaN | 0.96 | NaN |
| 63 | docker_005:local_method_013:db_003 | NaN | 1.09 | NaN |
| 64 | docker_005:local_method_014:db_003 | NaN | 0.70 | NaN |
| 65 | docker_005:local_method_015:db_003 | NaN | 1.13 | NaN |
| 66 | docker_005:local_method_016:db_003 | NaN | 0.73 | NaN |
| 67 | docker_005:local_method_017:db_003 | NaN | 1.19 | NaN |
| 68 | docker_005:nan:db_003 | NaN | 16.03 | NaN |
| 69 | docker_006:csf_002:nan | NaN | 9.88 | NaN |
| 70 | docker_006:csf_003:nan | NaN | 4.49 | NaN |
| 71 | docker_006:csf_004:nan | NaN | 5.17 | NaN |
| 72 | docker_006:csf_005:nan | NaN | 3.23 | NaN |
| 73 | docker_006:local_method_011:db_003 | NaN | 0.79 | NaN |
| 74 | docker_006:local_method_013:db_003 | NaN | 0.77 | NaN |
| 75 | docker_006:local_method_014:db_003 | NaN | 0.52 | NaN |
| 76 | docker_006:local_method_015:db_003 | NaN | 0.80 | NaN |
| 77 | docker_006:local_method_016:db_003 | NaN | 0.46 | NaN |
| 78 | docker_006:local_method_017:db_003 | NaN | 0.58 | NaN |
| 79 | docker_006:nan:db_003 | NaN | 9.05 | NaN |